Stochastic Contextual Bandits with Long Horizon Rewards
نویسندگان
چکیده
The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step this direction by investigating contextual linear bandits where current reward depends on at most s prior actions contexts (not necessarily consecutive), up to time horizon h. In order avoid polynomial dependence h, we propose new algorithms that leverage sparsity discover pattern arm parameters jointly. We consider both data-poor (T= h) regimes derive respective regret upper bounds O(d square-root(sT) +min(q, T) O( square-root(sdT) ), with s, feature dimension d, total T, q is adaptive pattern. Complementing bounds, also show single trajectory brings inherent challenges: While form rank-1 matrix, circulant matrices are not isometric manifolds sample complexity indeed benefits from sparse structure. Our results necessitate analysis address long-range temporal dependencies across data Specifically, utilize connections restricted isometry property formed dependent sub-Gaussian vectors establish guarantees independent interest.
منابع مشابه
Contextual Bandits with Stochastic Experts
We consider the problem of contextual bandits with stochastic experts, which is a variation of the traditional stochastic contextual bandit with experts problem. In our problem setting, we assume access to a class of stochastic experts, where each expert is a conditional distribution over the arms given a context. We propose upper-confidence bound (UCB) algorithms for this problem, which employ...
متن کاملNonparametric Stochastic Contextual Bandits
We analyze the K-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identification and a sublinear regret of Õ ( T 1+D 2+D ) , whereD is the context dimension, for a modified UCB algorithm that is simple to implement (kNN-UCB). We then give global intrinsic dimension dep...
متن کاملStochastic Contextual Bandits with Known Reward Functions
Many sequential decision-making problems in communication networks such as power allocation in energy harvesting communications, mobile computational offloading, and dynamic channel selection can be modeled as contextual bandit problems which are natural extensions of the well-known multi-armed bandit problem. In these problems, each resource allocation or selection decision can make use of ava...
متن کاملLinear Contextual Bandits with Knapsacks
We consider the linear contextual bandit problem with resource consumption, in addition to reward generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource consumptions. The expected values of these outcomes depend linearly on the context of that arm. The budget/capacity constraints require that the total consumption doesn’t exceed the budget for each ...
متن کاملContextual Bandits with Similarity Information
In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now wellunderstood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i8.26140